arXiv:1509.01746v3 [cs.DC] 16Jul2016 


An Optimal Single-Path Routing Algorithm 
in the Datacenter Network DPillar 


Alejandro Erickson, Iain A. Stewart 
School of Engineering and Computing Sciences, Durham University, 
South Road, Durham DHl 3LE, U.K. 

Abbas Eslami Kiasari, Javier Navaridas 
School of Computer Science, University of Manchester, 
Oxford Road, Manchester M13 9PL, U.K. 


Abstract 

DPillar has recently been proposed as a server-centric datacenter network and is combinatorially 
related to (but distinct from) the well-known wrapped butterfly network. We explain the relationship 
between DPillar and the wrapped butterfly network before proving that the underlying graph of DPillar 
is a Cayley graph; hence, the datacenter network DPillar is node-symmetric. We use this symmetry 
property to establish a single-path routing algorithm for DPillar that computes a shortest path and 
has time complexity 0{k), where k parameterizes the dimension of DPillar (we refer to the number of 
ports in its switches as n). Our analysis also enables us to calculate the diameter of DPillar exactly. 
Moreover, our algorithm is trivial to implement, being essentially a conditional clause of numeric tests, 
and improves significantly upon a routing algorithm earlier employed for DPillar. Furthermore, we 
provide empirical data in order to demonstrate this improvement. In particular, we empirically show that 
our routing algorithm improves the average length of paths found, the aggregate bottleneck throughput, 
and the communication latency. A secondary, yet important, effect of our work is that it emphasises that 
datacenter networks are amenable to a closer combinatorial scrutiny that can significantly improve their 
computational efficiency and performance. 
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1 Introduction 

Datacenters are assuming an increasingly important role in the global computational infrastructure. They 
provide platforms for a wide range of data-intensive applications and activities including web search, social 
networking, online gaming, large-scale scientific deployments and service-oriented cloud computing. There 
is an increasing demand that datacenters incorporate more and more servers, and do so in a cost-effective 
fashion, but still so that the resulting platform is computationally efficient (in various senses of the term). 

A datacenter network (DON) comprises the physical communication infrastructure underpinning a dat¬ 
acenter. One of the main aspects of a datacenter network is the topology by which the servers, switches 
and other components of the datacenter are interconnected; the choice of topology strongly influences the 
datacenter’s practical performance (see, e.(?., [IH]). For simplicity, henceforth by DON we refer to the dat¬ 
acenter network topology. Originally, DCNs were hierarchical with expensive core routers that became 
bottlenecks in terms of both performance and cost. They evolved into tree-like, switch-centric DCNs, built 
from commodity-off-the-shelf (COTS) components; that is, so that the servers are located at the ‘leaves’ of 
a tree-like structure that is composed entirely of switches and where the routing intelligence resides within 
the switches. Such DCNs can offer better load balancing capabilities and so are less prone to bottlenecks 
but have limited scalability due to (the size of) routing tables within the switches. Typical examples of such 
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switch-centric DCNs are ElasticTree [Tl], Fat-Tree [5], VL2 [TU], HyperX [3], Portland [50] and Flattened 
Butterfly [1]. 

Alternative architectures have recently emerged and server-centric DCNs have been proposed whereby the 
interconnection intelligence resides within the servers as opposed to the switches. Now, switches only operate 
as dumb crossbars (and consequently the need for high-end switches is diminished as are the infrastructure 
costs). This paradigm shift means that more scalable topologies can be designed and the fact that routing 
resides within servers, which are easier to program than are switches, means that more effective routing 
algorithms can be adopted. However, server-centric DCNs are not a panacea as packet latency can increase, 
with the need to handle routing providing a computational overhead on the server. Typical examples of 
server-centric DCNs are DCell [T5], BCube [T3], FiConn [TB], CamCube |5], MCube [53], DPillar [T3], HCN 
and BCN [TT] and SWCube, SWKautz, and SWdBruijn [T7]. An additional positive aspect of some server¬ 
centric DCNs is that not only can commodity switches be used to build the datacenters but commodity 
servers can too; the DCNs FiConn, MCube, DPillar, HCN, BCN, SWCube, SWKautz, and SWdBruijn are 
all such that any server only needs two NIC ports (the norm in commodity servers) in order to incorporate 
it into the DCN. 

It is with the DCN DPillar that we are concerned here. DPillar is an established and one of the most 
promising benchmark dual-port server-centric DCNs. Moreover, DPillar is one of the even fewer dual¬ 
port server-centric DCNs for which no server-node is adjacent to any other server-node, the others being 
SWKautz, SWCube, and SWdBruijn. DPillar has recently been compared with other dual-port server¬ 
centric DCNs m- It was shown that when the diameter of the DCN is normalized, DPillar can incorporate 
more servers than FiConn and BCN, a similar number of servers to SWCube, and (usually) less servers 
than SWKautz and SWdBruijn. However, DPillar, SWCube, SWKautz, and SWdBruijn were shown to 
have similar bisection widths and all have better bisection widths than FiConn and BCN. Whilst SWCube, 
SWKautz, and SWdBruijn were compared with each other in m with regard to aspects of routing in relation 
to fault-tolerance and handling congestion, there was no comparison of these three DCNs with DPillar. Such 
an evaluation is currently missing and would obviously be tied to a particular routing algorithm for DPillar, 
an observation that we will return to in a moment. 

As we shall see, DPillar is essentially obtained by replacing complete bipartite subgraphs in a 

wrapped butterfly network (see, e.g., |15j l with a switch with n ports. In |18j . basic properties of DPillar 
are demonstrated and single-path and multi-path routing algorithms are developed (along with a forwarding 
methodology for the latter). Our focus here is on single-path routing (also known as single-source determin¬ 
istic routing). The algorithm in |18| is appealing in its simplicity but for most source-destination pairs it does 
not produce a path of shortest length; indeed, there is often a signihcant discrepancy between the lengths 
of the path produced by the algorithm in m and a shortest path (as we demonstrate later). We remedy 
this situation and develop a single-path routing algorithm that always outputs a shortest path. Although 
the proof of correctness of our algorithm is non-trivial, the actual algorithm itself is a very simple sequence 
of numeric tests and has the same time complexity as the original single path routing algorithm, ie., linear 
in the number of columns within DPillar. 

Furthermore, we undertake an empirical evaluation and show that according to our experiments, the 
original single path routing algorithm for DPillar from [T3] fails to provide a shortest path route for more 
than 51% and up to 78% of the server pairs; this translates into our algorithm giving an improvement in the 
range of 20-30% in terms of the average path length derived. Note that a reduction in path length not only 
means that the latency of the network traffic will be reduced (between 20 and 25%, in our experiments), 
but also that as less resources are required for transmitting data, the overall throughput of the network 
should also increase. To verify this latter contention, we empirically measure the aggregate bottleneck 
throughput (the most widely accepted datacenter throughput metric) for both algorithms and we find that 
our algorithm yields improvements in the range of 25-120%, with a mean of 65% and a median of 75%. The 
substantial improvements in average path length and throughput, together with the algorithmic simplicity of 
our proposal, more than motivates its utilization in production systems. As by-products of the development 
of our algorithm, we prove that the DCN DPillar is, in essence, a Cayley graph, and thus node-symmetric 
(that is, there is an automorphism mapping any server to any other server), and we obtain the diameter of 
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the DCN DPillar exactly. 

Let us now return to our earlier remark as regards the current lack of a comparison in the literature of 
DPillar with SWCube, SWKautz, and SWdBruijn with respect to aspects of routing in relation to fault- 
tolerance and handling congestion. Were we to embark on this comparison prior to the results of our paper 
then we would be doing a disservice to DPillar as we would be working with the routing algorithm from |18j 
which we prove (and empirically validate) here to be significantly worse in all respects than the routing 
algorithm we develop in this paper. We intend in future to undertake an extensive evaluation of aspects 
of routing for dual-port server-centric DCNs including DPillar, SWCube, SWKautz, and SWdBruijn but 
thanks to the results of this paper, this will now be with respect to our improved routing algorithm for 
DPillar (of course, such an evaluation is beyond the scope of this paper). 

In the next sections, we give an explicit definition of the DCN DPillar, both algebraically and as a 
derivation from wrapped butterfly networks, before showing how to abstract DPillar as a directed graph and 
proving that the resulting directed graph is a Cayley graph; an immediate consequence is that the DCN 
DPillar is node-symmetric. In Sectional and using the newfound property of node-symmetry, we explain how 
solving the single-path routing problem in our abstraction of DPillar can be further abstracted so that it is 
equivalent to a routing problem in what we call a marked cycle, and in Section [S] we prove that shortest paths 
in this marked cycle must have severe restrictions on their structure. We use these restrictions to develop our 
single-path routing algorithm for DPillar in Section |6] and establish its correctness and its time complexity. 
To support our theoretical analysis, we provide empirical evidence that the length of the (shortest) path 
obtained by our single-path routing algorithm is significantly shorter than the length of the path obtained by 
the single-path routing algorithm from |18j for many source-destination pairs, and we calculate the diameter 
of DPillar explicitly. Our conclusions and directions for further research are given in Section 1^. 

2 The DCN DPillar 

In this section, we explicitly define the DCN DPillar and explain how the DCN DPillar can be (informally) 
constructed from a wrapped butterfly network. 

2.1 A definition of DPillar 

The DCN DPillar [18] consists of a collection of switches, each of which has n ports, with n > 2 even, and 
a collection of servers, each of which has 2 NIC ports. The names of the servers are {{c,Vk-iVk -2 ■ ■ -Vq) : 
0<c<fc — l;0<Ui<^ — l;0<i<A: — 1} where k > 2 (we refer to k as the dimension)-, the first 
parameter, c, is the column-index and denotes the column in which the server resides, whilst the second 
parameter Vk-iVk -2 ■ ■ - Vq is the row-index and denotes the server’s position within a column (from the left, 
the bit positions are /c — 1, fc — 2,..., 0; note that we refer to the values as ‘bits’ and their positions as ‘bit’ 
positions). We denote the DCN DPillar with parameters n and fc, as above, by DPillar„_fe. Consequently, 
DPillarn^fc has fc(f)^ servers. 

We term the collections of servers ‘columns’ as we visualize the servers within a column as being stacked 
vertically within that column, with the row-indices of the servers, from top to bottom, being given in 
increasing lexicographic order on {0, 1,..., ^ — 1}^; so, if n = 6 and k = A, for example, then the ordering 
is given by 0000 < 0001 < 0002 < 0010 < 0011 < 0012 < 0020 < ... and so on. There are (§)^“^ switches 
located between column i and column *-|-1, for f = 0,1,..., fc — 2, and also between column k—1 and column 
0; thus, there are fc(^)^“^ switches in DPillar„ j,. We think of the switches between two columns of servers 
as appearing in a column too, with the names of the switches in a column being {0,1,..., ^ — 1}^“^ and 
again stacked from top to bottom in increasing lexicographic order. If a switch lies between server-column c 
and server-column c-l- 1, where c S {0,1,..., fc — 1} and addition is modulo fc, then we say that its column is 

'^Some results from this paper appeared in preliminary form in: A. Erickson, A. Kiasari, J. Navaridas and LA. Stewart, An 
efficient shortest path routing algorithm in the data centre network DPillar, Proc. of 9th Ann. Int. Conf. on Combinatorial 
Optimization and Applications, 2015, pp. 209-220; some proofs and results were omitted and there was no experimental 
evaluation. 
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Figure 1: Visualizing DPillarg 3 . 


column c (henceforth, we assume that addition and subtraction on the names of columns are always modulo 
k). The columns of servers and switches for DPillarg, 3 can be visualized as in Fig. [T] (note that the servers 
in the right-most and left-most columns are identical but are shown separately to facilitate visualization). 

All links are server-switch links and are from a server in (server-)column c to a switch in (switch-)column 
c or from a server in (server-)column c + 1 to a switch in (switch-)column c (where c S { 0 , 1 ,..., fc — 1 }). 
Let {c,Vk-iVk -2 ■ ■ -Vq) be a server in column c. The switch to which it is connected in column c is the 
switch named Vk-i ■ ■ ■ Vc+iVc-i ■ ■ -Vq. If (c -I- 1, Vk-iVk-2 ■ ■ ■ vo) is a server in column c -I- 1 then the switch 
to which it is connected in column c is the switch named Vk-i ■ ■ ■ Vc+iVc-i.. .vq. So, for example, the server 
(c, Vk-i ■ • ■ Vc+i * Vc-i ■ ■ ■ fo)i where * denotes that we may substitute in any number from { 0 , 1 ,..., ^ — 1 }, 
is connected to the switch Vk-i ■ ■ .Vc+iVc-i ■■ - Vq in column c, which in turn is connected to the server 
(c -b 1, Vk-i ■ ■ ■ Vc+i * Vc-i ■ ■ ■ Vq). Similarly, the server (c, Vk-i.. .Vc* Vc -2 ■ ■ ■ vq) is connected to the switch 
Vk-i ■ ■ ■ VcVc-2 ■ ■ - Vq in coluiun c — 1, which in turn is connected to the server (c — 1, Vk-i.. .Vc* Vc-2 ■ ■ ■ iio)- 
The server-switch links for DPillarg ,3 can be visualized as in Fig. [TJ 

An alternative informal definition of DPillar„,fe can be given. With reference to Fig. [U we can replace 
every switch with a complete bipartite graph (the bipartition is the obvious one). What results is 

the well-known wrapped butterfly network (see, e. 5 ., [15) : this network has been well-studied within the 
context of multiprocessor systems). The primary difference between DPillar„,fc and the resulting wrapped 
butterfly network is that a switch in DPillarji,^ enables direct server-to-server communication between servers 
connected to the same switch and in the same column, whereas such communication is absent in the wrapped 
butterfly network. 

2.2 Abstracting DPillar 

We can abstract DPillar„,fc as a digraph as follows: the nodes of this graph are the servers of DPillar„,fc; and 
there is an edge from a source-node to a target-node if there is a link from the corresponding source-server 
to a switch and a link from that switch to the corresponding target-server (so, the edges correspond to 
server-switch-server paths). There are 4 types of edges in the digraph abstracting DPillar 
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(i) clockwise edges {c-edges) which are edges of the form 

((c, Vk-l ■ . ■ Vc+lVcVc-l . ■ . Vo), (c + 1, Vk-l ■ . ■ Vc+l * Vc -1 ■ ■ ■ Wq)) 

[ii) anti-clockwise edges {a-edges) which are edges of the form 

((c, Vk-l ■ ■ ■ VcVc-lVc-2 ■.■Vo),ic-l, Vk-l ...Vc* Vc-2 ■ ■ ■ Vq)) 

[Hi) basic static edges [b-edges) which are edges of the form 

((c, Vk-l ■ • ■ Vc+lVcVc-l . ■ . Vo), [c, Vk-l ■ • ■ Vc+l * Vc -1 ■ ■ ■ Vo)) 

[iv) decremented static edges [d-edges) which are edges of the form 

((c, Vk-l ■ • ■ VcVc-lVc-2 ■■■Vo), [c, Vk-l ■■■Vc* Vc-2 • ■ • Wo))- 

So, within our abstraction of DPillar„ ^ as a digraph, the nodes are the servers and are located in columns 
0 ,1 ,..., fc — 1 (as before) with all edges joining nodes in consecutive columns (clockwise and anticlockwise 
edges) or nodes in the same column (static edges). In fact, our digraph (where each node has in- and out- 
degree 2n — 2) can also be thought of as an undirected graph (that is regular of degree 2n — 2) as all edges 
come in oppositely oriented pairs. Note that the clockwise (resp. anti-clockwise, basic static, decremented 
static) edge above corresponds to a server-switch-server path in the DCN DPillar„_fe from a column c server 
through a column c (resp. c— 1, c, c— 1) switch and on to a column c-l-1 (resp. c— 1, c, c) server. Henceforth, 
we denote the digraph abstracting DPillar„^fe by DPillarjj^fc too (this causes no confusion). The abstraction 
of DPillar can be visualized as in Fig. [T] where we show how the switch 00 in column 2 gives rise to a set of 
edges in the abstraction of DPillar as a graph. We annotate edges as follows: an edge annotated ‘a’ is an 
anti-clockwise edge relative to the node (0,000) (the arrow on the edge from (0,000) denotes that the label 
is with respect to (0,000)); an edge annotated ‘b’ is a basic static edge relative to node (2,000); an edge 
annotated ‘c’ is a clockwise edge relative to node (2, 000); and an edge annotated ‘d’ is a decremented static 
edge relative to node (0,000) (so, an edge has two labels: one relative to one incident node; and another 
relative to the other incident node). In short, for some node, the adjacent switch ‘to the right’ gives rise to 
b-edges and c-edges, and the one ‘to the left’ gives rise to a-edges and d-edges. 

3 DPillar is a Cayley Graph 

In this section, we prove that the digraph DPillar„_fe is a Cayley graph, and consequently node-symmetric 
(we exploit this node-symmetry later on in our single-path routing algorithm and in our experimental work). 
Recall that a graph is a Cayley graph if the nodes can be labelled with the elements of a (algebraic) group 
G and there is a generating subset S C G that is closed under inverses so that every directed edge [u, v) is 
labelled with an element of s € S' if, and only if, us = v (within the group G). We say that a digraph is 
node-symmetric if given any 2 distinct nodes src and dst, there is an automorphism (that is, a one-to-one 
mapping of the node-set onto itself such that if [u,v) is an edge then [(p[u), ip[v)) is an edge) mapping 
src to dst. It is well-known, and trivial to prove, that every Cayley graph is node-symmetric. The first 
paper to establish that being a Cayley graph is a useful property for an interconnection network is [4] and 
since then, there has been much research into representing interconnection networks using finite groups. Not 
only do we immediately obtain that any Cayley graph is node-symmetric (which is a fundamental property 
of interconnection networks 0) but Cayley graphs have been shown to be relevant to various networks in 
a variety of ways; for example, with regard to the design of interconnection networks by pruning nodes 
and edges from tori [^, the design of wireless DCNs [22], and the design of high-dimensional mesh-based 
interconnection networks |6]. 
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3.1 DPillar Symmetry 

Whilst it was stated in [TB] that the DCN DPillar is ‘symmetric’, it was not stated as to what ‘symmetric’ 
meant (hence, there was no proof of ‘symmetry’). Our main intention is to show that DPillar is node- 
symmetric (defined above) but we do this by proving that DPillar is a Cayley graph. 

Lemma 1. The digraph DPillarn^k is a Cayley graph. 

Proof. Our proof is related to the proof in [9] that the wrapped butterfly network (called the cyclic cube 
in [5]) is a Cayley graph. The full proof can be found in the supplemental material. □ 

We obtain the immediate corollary. 

Corollary 2. The digraph DPillarn^k is node-symmetric. 

4 Abstracting routing in DPillar 

In this section, we abstract the problem of finding a path in the digraph DPillar„ ^ from a given source-node 
to a given destination-node so that ultimately this problem is equivalent to finding a path from a source-node 
to a destination-node in a cycle of length k but where the actual node-to-node moves are more complicated 
than in a digraph. We also explain the single-path routing algorithm from m- 

4.1 Fixing bits 

It is important to appreciate what might be accomplished by moving along one of the 4 different types 
of edge highlighted above. Suppose that we are attempting to move from some source-node src to some 
destination-node dst within DPillar„ ^ and that we are currently at some node in column c. We can choose 
a clockwise (resp. anti-clockwise, basic static, decremented static) edge so as to set the cth (resp. (c — l)th, 
cth, (c — l)th) bit in the row-index to whatever value from {0,1,..., ^ — 1} that we like. Consequently, 
by choosing a clockwise (resp. anti-clockwise, basic static, decremented static) edge along which to move, 
we can ‘fix’ the cth (resp. (c — l)th, cth, (c — l)th) bit of the row-index so that it matches that of the 
destination-node. We say that: a clockwise edge covers the column in which its source-node lies; an anti¬ 
clockwise edge covers the column in which its target-node lies; a basic static edge covers the column in which 
both its source- and target-nodes lie; and a decremented static edge covers the column that is adjacent in 
an anti-clockwise direction to the column in which both its source- and target-nodes lie. Thus, if we wish to 
move along some path from src to dst then we need to ensure that we move from column to column so as 
to fix all of the bits of the row-index that need fixing, but so that we don’t subsequently ‘unfix’ them, and 
so that we end up in the column within which dst resides (with regard to not ‘unfixing’ a bit, note that we 
can always move from a node in one column to a node in an adjacent column so that the row-index remains 
unchanged). This is equivalent to moving from column to column so that every row-index bit-position, i.e., 
column, where the bit values of src and dst differ is necessarily covered by some edge and so that we end up 
in the column within which dst resides. If we are looking for a shortest path from src to dst then we have 
to do this using as few moves as possible. Of course, any path of length I in our abstraction of DPillar„_fe 
as a digraph translates to a path consisting of I server-switch-server link-pairs in the DCN DPillarand 
vice versa (for the sake of uniformity, we measure the length of server-to-server paths in the DCN DPillar in 
terms of the number of server-switch-server link-pairs in the path; this is also common practice in the DCN 
community). 

As an illustration, suppose we are at (1,12530) in DPillarg^s and wish to get to the destination (4, 54314). 
If X denotes any element of {0,1, 2,3,4, 5}, there is: an anti-clockwise edge taking us to (0,1253a:); a basic 
static edge taking us to (l,125a:0); a clockwise edge taking us to (2,125a:0); and a decremented static 
edge taking us to (1,1253a:). Given our destination, when we move we can choose x accordingly and fix 
the appropriate bit so that we move: via an anti-clockwise edge to (0,12534); via a basic static edge to 
(1,12510); via a clockwise edge to (2,12510); or via a decremented static edge to (1,12534). 
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4.2 Another abstraction 


A crucial observation arising from the above discussion is that when routing in DPillar„ fc, the actual value 
of some bit in a row-index of some node is unimportant: what matters is whether this value is equal to or 
different from the value of the corresponding bit in the row-index of the destination-node (that is, whether 
the bit needs to be ‘fixed’ or not). Consequently, in order to solve the problem of finding a path from src, 
which lies in column src', to dst, which lies in column dst', in DPillar„ fe, we can abstract the problem as a 
(more involved) routing problem in the following digraph Gn,k{src',dst')-. 

• we think of there being one node for each of the k columns of nodes of DPillar„ t with nodes in 
Gn,k{src',dst') that correspond to adjacent columns being joined by an oppositely oriented pair of 
edges (so, we can also think of Gn,k{src', dst') as an undirected cycle of length k) 

• we mark every node c, corresponding to some column c (or, alternatively, some bit-position c in the 
row-index of some node of DPillar„ fe) that needs to be covered (because bit c of the row-index of src 
is different from bit c of the row-index of dst), with the set of marked nodes being denoted by B 

• we move from node to node in Gn,k{src',dst'), starting at the node src' so as to end at the node dst' 
and making moves where: 

(i) a c-move means we move from node c to node c -I- 1 and such a move covers node c 

(m) an a-move means we move from node c to node c — 1 and such a move covers node c — 1 

(zm) a b-move means we stay at node c and such a move covers node c 

(iv) a d-move means we stay at node c and such a move covers node c — 1 

(note the correspondence between the above moves and the edge types given in Section 12.21) . We call 
Gn,k{src', dst') a marked cycle. Note that it might be the case that src' = dst' in Gn,k{src', dst') (this would 
mean that the nodes src and dst lie in the same column in DPillar„ fc). 

With regard to our illustration in the previous section, the edge from (1,12530): to (0,12534) results 
in an a-move covering node 0 in the marked cycle; to (1,12510) results in a b-move covering node 1 in the 
marked cycle; to (2,12510) results in a c-move covering node 1 in the marked cycle; and to (1,12534) results 
in a d-move covering node 0 in the marked cycle. 

It should be clear as to how moves in the marked cycle Gn,k(,src',dst') correspond to moves along 
corresponding edges in DPillar„^fc (and so to server-switch-server link-pairs in the DCN DPillar„_fc) with the 
coverage of a node in Gn,k{src',dst') and a node of DPillar„^fe being in direct correspondence. A path in 
Gn,k(,src', dst') is a sequence of moves leading from src' to dst' and corresponds to a path in DPillar„ fc from 
node src to node dst (and vice versa) with the lengths of the two paths being identical. Consequently, in 
order to hnd a shortest path from src to dst in the DCN DPillar„ fc, it suffices to hnd a shortest path in the 
marked cycle Gn,k{src', dst') (from the node src' to the node dst') so that every marked node is covered by 
a move. Note that if src' = dst' then the empty sequence of moves does not constitute a legitimate path. 


4.3 Basic routing in DPillar 

Before we continue, let us discuss the single-path routing algorithm for DPillar as detailed in [18]; we refer 
to this algorithm as DPillarSP. The routing algorithm DPillarSP operates in 2 phases: in the first phase (the 
so-called ‘helix’ phase), a path in the DCN DPillar„^fe is chosen so that movement is always in a clockwise 
direction (that is, the column-index is always incremented) or always in an anti-clockwise direction (that is, 
the column-index is always decremented) in order that the row-index is ‘fixed’ so that it is identical to that 
of the destination-node; and in the second phase (the so-called ‘ring’ phase), a path is subsequently chosen 
so as to reach the destination-node without amending the row-index and so that movement is in the same 
direction as in the first phase. Although not explicitly mentioned when discussing their algorithm, it is clear 
that the time complexity of the single-path routing algorithm from [18] is 0{k) (we have suppressed the logn 
component required to represent each bit-value). 
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It is stated in na Section 3.1] that this single-direction movement is so that ‘loops’ might be avoided. 
While this statement was not explained further, it is probable that what was meant by ‘loops’ was a loop 
within a single route for a source-destination pair. Of course, our shortest-path routing algorithm means that 
loops in a single path will never occur. Alternatively (though unlikely), the rationale for the decision in |18j 
to restrict to single-direction movement might have been to avoid either network-level deadlock or livelock 
due to dependency loops (see, e.g., [3 Ch. 14]). Irrespective of the intentions in [18] . it is worth commenting 
on the potential for deadlocks in DPillar and server-centric DCNs in general. Given that the topology of 
DPillar is basically a sophisticated ring of columns, moving in a single direction does not completely prevent 
dependency loops from appearing. We give an example in Fig. [3 where there is a (bold) route from (0, 000) 
to (2, 200) and a (dotted) route from (1, 200) to (1,000) so that there is a cyclic dependency graph, due to the 
shared switches (0,00) and (1, 20), even though we are using single-direction routing. Nevertheless, there are 
many reasons to believe that, in the context of server-centric DCNs based on COTS hardware and software 
(ie., Ethernet hardware and TCP/IP stack), network level deadlocks should be a minor concern. First, 
commodity Ethernet hardware uses packet-switching which prevents network frames from spreading across 
many network components; therefore a cyclic dependency between frames is unlikely to happen. Second, 
servers have virtually unlimited memory (and indeed, many orders of magnitude more than switches); hence 
we can assume infinite FIFOs at the servers. Considering that one of the necessary conditions for deadlocks 
to appear is for FIFOs to become full, it is, again, very unlikely that we end up in a deadlock situation. 
Finally, in the very unlikely situation of a cyclic dependency appearing and all the FIFOs becoming full, the 
packet-dropping mechanism of Ethernet-based hardware provides seamless deadlock recovery, whereas TCP 
ensures data delivery. The upshot is that deadlocks are not a primary concern in DCNs. 


column-index 

1 



row-index 


row-index 


Figure 2: A dependency loop between two routes with DPillarSP. Server (0, 000) sends to (2, 200) and server 
(1, 200) sends to (I, 000). The paths through switches (0, 00) and (1, 20) are conflicted. 

It is very easy to see (by looking at some typical source-destination examples) that the routing algorithm 
DPillarSP is by no means optimal and that more often than not much shorter paths exist (an upper bound of 
2fc — I on the lengths of paths produced was stated in [H]). For example, if one chooses to route in a clockwise 

























































































































fashion in DPillar„_fc with the source (0, 00 ... 0) and the destination (1,10 ... 0) then the DPillarSP yields a 
path of length + 1, and if one routes in an anti-clockwise fashion then the algorithm also yields a path of 
length fc — 1; however, a shortest path has length 2 (a d-move followed by a c-move). Our contention is that 
by relaxing this insistence on single-direction movement, we can obtain a much improved routing algorithm; 
indeed, as we shall see, we develop an optimal single-path routing algorithm (where the implementation 
overheads are negligible and where there are significant practical benefits). 

5 Routing in a marked cycle 

We begin by making some initial observations as regards routing along a shortest path (from srd to dst’') in 
a marked cycle Gn,kisrc',dst') before proving that any such shortest path has a restricted structure. 

5.1 Some initial observations 

Henceforth, p is a shortest path from srd to dst' in Gn,k{srd,dst'). Consider two consecutive moves in p. 
We can often rule out consecutive pairs of moves. For example, suppose that we have within p a c-move 
followed by an a-move. We can replace this pair within p by a b-move so as to obtain a path with identical 
coverage to p and which is shorter. This yields a contradiction. Similarly, suppose that we have an a-move 
followed by a c-move within p. We can replace this pair within p by a d-move so as to again obtain a 
contradiction. In Table [TJ we detail all pairs of consecutive moves in p that are forbidden by including the 
substitution that would result in a shorter path that has equivalent coverage. In this table, the first move is 
detailed in the rows and the second move in the columns. A blank cell means that the corresponding pair 
of moves cannot immediately be ruled out. 


Table 1: Disallowed pairs of moves. 



a-move 

b-move 

c-move 

d-move 

a-move 

b-move 

c-move 

d-move 

b-move 

a-move 

a-move 

b-move 

d-move 

c-move 

C-move 

d-move 


For clarity, rather than say, for example, ‘a c-move followed by an a-move’, in future we will simply write 
ca to denote this circumstance. Consequently, subsequences of moves within p will be written as strings 
over {a, b, c, d} (as will p itself) and we compress subsequences of the same symbol, such as aaaa, by using 
powers, such as a^. 

We can say more. If we have a subsequence of moves bd then this has the same effect as the subsequence 
db, and so we may suppose that a subsequence db within p is forbidden. Also, note that if p has length at 
least 3 then we cannot have a subsequence bd\ 

• a subsequence bdb can be replaced by bd] a subsequence bdc can be replaced by dc] and we cannot have 
a subsequence da or dd 

• a subsequence cbd can be replaced by cb] a subsequence dbd can be replaced by db] and we cannot have 
a subsequence ab or bb. 

Consequently, if p has length at least 3 then: 

• if a c-move is not the final move of p then it must be followed by another c-move or a b-move 

• if an a-move is not the final move of p then it must be followed by another a-move or a d-move 

• if a b-move is not the final (resp. first) move of p then it must be followed by an a-move (resp. preceded 
by a c-move) 
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• if a d-move is not the final (resp. first) move of p then it must be followed by a c-move (resp. preceded 
by an a-move). 

Consequently, if p has length at least 3 then it must be of one of two forms: 

(1) possibly a d-move (but maybe not) followed by a sequence of c-moves followed by a b-move followed by 
a sequence of a-moves followed by a d-move followed by a sequence of c-moves followed by ... followed 
by a sequence of c-moves (resp. a-moves) possibly followed by a b-move (resp. d-move); that is, 

or d^d^ba'^dc^^ ... a^’^d^ 

for some m > 1, where ■ jim > 1 and where e,(5 S {0,1} 

(2) possibly a b-move followed by a sequence of a-moves followed by a d-move followed by a sequence of 
c-moves followed by a b-move followed by a sequence of a-moves followed by ... followed by a sequence 
of a-moves (resp. c-moves) possibly followed by a d-move (resp. b-move); that is, 

b^a^^dc>^ba^^ ... a^^d^ or b'^a^^ dcd^ba^^ ...cd’^b^, 

for some m > 1, where *i, * 2 , ■ ■ •, ji, , Jm > 1 and where e,6 € {0,1} 

(when we say ‘sequence’, above, we mean ‘non-empty sequence’). 

5.2 Restricting the number of turns 

If we have a subsequence cba in p then we say that an anti-clockwise turn, or simply an a-turn, occurs at 
the b-move; similarly, if we have a subsequence adc then we say that a clockwise turn, or simply a c-turn, 
occurs at the d-move. Note that if we have an a-turn in p then the node at which this turn occurs, i.e., 
the node that is covered by the d-move, must be marked in Gn,k{srd,dst') as otherwise we could delete 
the corresponding d-move from p and still have a sequence from src' to dst' covering all the marked nodes, 
which would yield a contradiction. Similarly, if we have a c-turn then the node at which this c-turn occurs, 
i.e., the node that is covered by the b-move, must be marked. We will use these observations later; but now 
we prove that any shortest path p must contain at most 2 turns. 

Suppose that p is a shortest path and has at least 3 turns. What we do now is undertake a case by case 
analysis of the different configurations that might arise. These cases arise from the forms derived at the 
end of the previous subsection: the first two cases correspond to form (1) and the next two cases to form 
(2). The technique employed in each case is to modify the path p, by replacing sequences of moves within 
p, so as to obtain a new path that has the same coverage but is shorter; this yields a contradiction to our 
assumption that p has at least 3 turns. 

Case (a): Suppose that p is of form (1) and has a prefix p' of the form dba^dd'ba, where i,j, I > 1. 

By this we mean that p begins with i c-moves followed by a b-move followed by j a-moves followed by a 
d-move followed by I c-moves followed by a b-move followed by an a-move. 

If j < i then we can replace the prefix c'^ba^dc in p' with dba^~^ and still obtain the same coverage; this 
contradicts that p is a shortest path (note that we have actually only assumed so far that p has 2 turns). If 
j = i then we can replace the prefix dba^dc in p' with ddba^~^ so as to obtain a contradiction (we have still 
actually only assumed that p has 2 turns). Hence, we must have that j > i. Suppose that j > I > j — i. We 
can replace the prefix dba^dc^ in p' with a^~^dc?ba^~^ so as to obtain a contradiction (we have still actually 
only assumed that p has 2 turns). Hence, j > i and either l<j — i or l>j. 

Suppose that I > j. We can replace the prefix c^ba^dc’’ in p' with a^~'‘dc^ so as to obtain a contradiction 
(we have still actually only assumed that p has 2 turns). Hence, we must have that j > i and I < j — i. 
However, if we replace p' with c'’ba^dd‘~^ then we obtain a contradiction (here we do use the fact that p has 
at least 3 turns). So, p has at most 2 turns and if it has 2 turns then p is of the form dba^dc^ where j > i 
and I < j — i. 
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Figure 3: Visualizing paths with 2 turns. 


We can say more if p has 2 turns. Suppose that j > k — 1. The b-move can be deleted from p' and we 
obtain a contradiction. Hence, if p has 2 turns then p is of the form c'ba^dd where fc — l>j>i>l and 
1 < I < j — i. We can visualize p as in Fig. m^)■ The marked cycle Gn,k{src' , dst') is shown as a cycle where 
a black node denotes a node of B] that is, a node that needs to be covered by some path in Gn,k{src' ,dst') 
(from src' to dst', with 0 = src' ^ dst' = x in this illustration). The path p is depicted as a dotted line 
partitioned into composite moves. 

Case ( 6 ): Suppose that p is of form (1) and has a prefix p' of the form dc'ba^ddba, where i,j, I > 1. 

If j < i then we can replace the prefix ddba^dc in p' with dc'ba^c so as to obtain a contradiction, and if 
j > i then we can delete the first d-move from p to obtain a contradiction. Hence, if p starts with a d-move 
then it has at most 1 turn. 

Case (c): Suppose that p is of form (2) and has a prefix p' of the form a'dc^ba^dc, where i,j, I > 1. 

If J < J then we can replace the prefix a'dc^ba in p' with a'dc^~^ so as to obtain a contradiction, li i = j 
then we can replace the prefix a^dc^ba in p' with ba'‘dc'~^ so as to obtain a contradiction. Hence, j > i. 

Suppose that j > I > j — i- We can replace the prefix a'dc^ba^ in p with d’~''ba^dc^~^ so as to obtain 
a contradiction. Suppose that I > j. We can delete the first occurrence of a d-move in p so as to obtain a 
contradiction. Hence, I < j — i- Note that if p has 2 turns then p is of the form a'dc^ba} where j > i and 
I < j — i. Alternatively, suppose that p has at least 3 turns. We can replace the prefix a'dGba^dc in p with 
a'dc^bd~^ so as to obtain a contradiction. Hence, p has at most 2 turns. 

We can say more if p has 2 turns. Suppose that j > k — 1. The d-move can be deleted from p' and we 
obtain a contradiction. Hence, if p has 2 turns then p is of the form a'dc^bd^ where k — l>j>i>l and 
1 < I < j — i. We can visualize p as in Fig. m ii). 

Case (d): Suppose that p is of form (2) and has a prefix p' of the form ba'dc^bal'dc, where i,j, I > 1. 

If j < i then we can replace the prefix ba'dc^ba with ba'dc^a so as to obtain a contradiction, and ii j > i 
then we can delete the first b-move from p to obtain a contradiction. Hence, if p starts with a b-move then 
it has at most 1 turn. 

So, we have proven the following lemma. 

Lemma 3. If p is a shortest path (from src' to dst') in Gn,k{src' ,dst') then p has at most 2 turns, and if 
p has 2 turns then it must be of the form c'ba^dc^ or a'da^ba), where k — l>j>i>l and 1 < I < j — i. 
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With reference to Fig. [31 the numerical constraints in Lemma [3] mean that there is no interaction or 
overlap involving the 2 turns in p. 


6 An optimal routing algorithm for DPillar 

We now develop an optimal single-path routing algorithm for DPillar, based around Lemma [31 We do this 
by finding a small set If of paths (from src' to dst') in Gn,k{src',dst') so that at least one of these paths is 
a shortest path (and consequently we obtain a shortest path in the DCN DPillarn^fc). By Lemma O we may 
assume that src = (0,00.. .0) and dst = {x,Vk-iVk -2 ■ • -Wo), and by Lemma El we may assume that any 
shortest path has at most 2 turns. 

Our technique is as follows. Essentially, we want to make the set If as small as possible; that is, we want 
our resulting algorithm to have to consider as few paths as possible (when looking for the shortest). LemmaEl 
precisely describes the set of paths we need to consider from the paths involving exactly 2 turns; of course, 
we also need to consider paths involving 1 or 0 turns (if they exist). There are different situations depending 
upon the distribution of the marked nodes needing to be covered; in particular, upon the distribution of 
marked nodes along the natural clockwise and anti-clockwise paths from the source to the destination on 
the marked cycle, assuming the source and destination to be distinct (this is the case in Section [ 6 Tl the case 
when the source and destination are the same is considered in Section 16.21) . Sometimes the distribution of 
marked nodes rules out the possibility of certain types of paths. 

6.1 Building our set of paths when X ^ 0 

We hrst suppose that 0 ^ x. Let B = {i : 0 < i < k — l,Vi ^ 0} (that is, the bit-positions that 
need to be ‘fixed’). Suppose that B \ {0,x} = {ii : 1 < I < r} U {ji : 1 < I < s} so that we have 
0 < js < js-i < ... < ji < C < ii < 12 <■■■< ir < k (we might have that either r or s is 0, 
when the corresponding set is empty). If r > 2 then dehne 5i = b-i-i ~ b, for I = 1,2, ...,r — 1, with 

S = max{(5/ : I = 1 , 2 ,..., r — 1 }; and if s > 2 then define e/ = ji — j/+i, for I = l,2,...,s — 1, with 

e = maxjej : ^ = 1, 2,..., s — 1 }. Also: define Aq = 1 (resp. 0), if 0 S i? (resp. 0 ^ B); and A^, = 1 (resp. 0), 
if X € B (resp. x ^ B). We can visualize the resulting marked cycle Gn,k{0,x) as in Fig. EKi). Note that in 
this particular illustration 0 ^ B and x € B; so, Aq = 0 and A^, = 1. Of course, what we are looking for is 
a sequence of (a-, b-, c- and d-)moves that will take us from 0 to a; in Gn,k{0, x) so that all nodes of B have 
been covered. 

In what follows, we examine different scenarios involving the number of marked nodes, r, and also the 
number of marked nodes, s. Each scenario for r contributes certain paths to II as does each scenario for s. 

Note that perhaps the most obvious paths to consider as potential members of II are the paths and 

(^2k-x have lengths k + x and 2k — x, respectively. So, we begin by setting II = 

From Lemma El any shortest path p from 0 to x having 2 turns requires that r > 2 or s > 2 and that 
both nodes at which these turns occur are different from 0 and x and lie on the anti-clockwise path from 0 
to X or on the clockwise path from 0 to x, accordingly. Recall also that the node at which any turn occurs 
on a shortest path p is necessarily a marked node (irrespective of the number of turns in p). 

Case (a): Suppose that r = 0. 

In this scenario, we contribute either the path c^b to II, if x € B, or the path to II, if x ^ B\ either way, 
the length of the path contributed is x -I- A^,. 

Case ( 6 ): Suppose that s = 0. 

In this scenario, we contribute either the path ha^~^ to II, if 0 € B, or the path to II, if 0 ^ B] either 
way, the length of the path contributed is fc — x -I- Aq . 

Case (c): Suppose that r = 1. 

In this scenario, we contribute 2 paths to II. If x € i? then we contribute the path to 

n, or if X ^ R then we contribute the path to II; either way, the length of the resulting 
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Figure 4: Visualizing our notation. 


path is 2k — 2ii + x — 1 + A^- We also contribute the path to 11 of length 2ii — x + 1. There is 

potentially another path when ii = x + 1 and x € B, namely dc^~^, but the length of this path is 

2k — X — 1 which is greater than 2A: — a; — 3 + A^;, which in turn is 2k — 2ii + a; — 1 + A^; evaluated with 
= a; + 1 . 

Case (d): Suppose that s = 1. 

In this scenario, we contribute 2 paths to 11. If 0 G i? then we contribute the path to 11, 

or if 0 ^ i? then we contribute the path to 11; either way, the length of the resulting path 

is fc — 2ji + a; — 1 + Aq. We also contribute the path to 11 of length k + 2ji — a; + 1. There is 

potentially another path when ji = 1 and 0 € B, namely but the length of this path is A: + a; — 1 

which is greater than fc + a; — 3 + Aq, which in turn is fc — 2ji + a; — 1 + Aq evaluated with ji = 1. 

Case (e): Suppose that r >2. 

In this scenario, we contribute r + I paths to 11. For each I G {l,2,...,r' — 1}, we contribute the 
path to 11 of length 2fc — 2Si — x. li x G B then we contribute the path 

.j-Q n, or if a: ^ i? then we contribute the path ( 2 fc-u-ig^j,fc-ii-i+x jj. gather way, the 
length of the path is 2fc — 2ii + a; — 1 + A^,. We also contribute the path to 11 of length 2ir — a; + 1. 

(These last 2 paths mirror those constructed in Case (c).) 

Case (/): Suppose that s > 2. 

In this scenario, we contribute s + 1 paths to 11. For each I G {1,2, ...,s — 1}, we contribute the path 

.j-g of length fc —2ei + a;. If 0 € S then we contribute the path 
to n, or if 0 ^ B then we contribute the path to 11 ; either way, the length of the path is 

fc — 2js + a; — 1 + Aq. We also contribute the path to 11 of length fc + 2ji — x + 1. (These last 2 

paths mirror those constructed in Case (c).) 

Thus, our set 11 of potential shortest paths contains r + s + 2 paths (from which at least one is a shortest 
path). 

6.2 Building our set of paths when a; = 0 

Now we suppose that a; = 0. We proceed as we did above and build a set 11 of potential shortest paths. 
Let B = {i Q < i < k — Vi ^ Q}. Suppose that B \ {0} = (i; : 1 < 1 < r} so that we have 
0 < < ^2 < ... < < A: (we might have that r is 0 when the corresponding set is empty). If r > 2 then 
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define 6i = *;+i — ij, for ^ = 1,2,..., r — 1, with <5 = max{(5i : Z = 1, 2,..., r — 1}. We define Aq = 1, if 0 € B, 
and Ao = 0, if 0 ^ B. We can visualize the resulting marked cycle G'„,fc(0, 0) as in Fig. Again, the 

most obvious path to consider is (or a^) which has length k. We begin by setting If = {c^}. 

Case(a): Suppose that r = 0. 

In this scenario, we contribute the path b of length 1 (note that in this case the node 0 is necessarily marked 
as we originally assumed that we started with distinct source and destination servers in the DCN DPillar„_fc). 
Case(&): Suppose that r = 1. 

If = k — 1 then we contribute the path bd, ii 0 € B, and the path d, ii 0 ^ B; either way, the path has length 

1 + Aq. Ifl = ii ^ k — 1 then we contribute the path cba of length 3. Ifl^iiT^fc — 1 then we contribute 

2 paths. The first of these paths is the path , ii 0 € B, and the path , if 

0 ^ B; either way, this path has length 2A: — 2ii — 1 + Aq. The second of these paths is the path d^ba^^ of 
length 2 ii + 1 . 

Case(c): Suppose that r >2. 

In this scenario, we contribute r + 1 paths to 11. For each I G {1, 2,..., r — I}, we contribute the path 

to 11 of length 2k — 25i. If 0 G S then we contribute the path ba^~'^^~^dc^~'^^~^ 
to n, or if 0 ^ i? then we contribute the path to 11 ; either way, this path has length 

2k — 2ii — 1 + Aq. We also contribute the path d^bd’’ to 11 of length 2A + 1. (These last 2 paths mirror 
those constructed in Case ( 6 ).) 

Thus, our set 11 of potential shortest paths contains at most r + 1 paths (from which at least one is a 
shortest path). 

6.3 Our algorithm 

We now use our set 11 of potential shortest paths so as to find a shortest path or the length of a shortest 
path. Our algorithm, DPillarMin, for finding the length of a shortest path in Gn,k{0,x) is as follows. 

Algorithm : DPillarMin 
calculate B 
If 0 ^ X then 


L = 

= minjfc + X, 2k 

-x} 



calculate r, s, 

5, e, Ao 

and Ax 


if 

r 

= 0 

then L : 

= min{L, x 

+ Aj;} 


if 

s 

= 0 

then L - 

= min{L, k 

- X + Ao} 


if 

r 

= 1 

then 





L 

= min{L, 2k 

— 2ii + X - 

-1 + Ax, 







2*1 — X + 

1 } 

if 

s 

= 1 

then 





L 

= min{T, k — 

■ 2 ji + X - 

1 +Ao, 







k + 2ji - 

X + 

if 

r 

> 2 

then 





calculate <5 % only need consider max. 5i 
L = min{L, 2k — 25 — x, 

2k — 2ii + X — 1 + Ax, 2ir — a; + 1 } 
if s > 2 then 

calculate e % only need consider max. ei 
L = min{L, k — 2e + x,k — 2js + x — 1 + Aq, 

k + 2ji — X + 1 } 

else 

calculate r and 5 
if r = 0 then L = 1 
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if r = 1 then 

if ii = k — \ then L = 1 + Aq 
if 1 = ii ^ k — \ then L = 3 
if 1 ^ ii ^ k — 1 then 

L = min{2fc — 2ii — 1 + Aq, 2ii + 1} 
if r > 2 then 

L = min{/c, 2k — 26, 2k — 2ii — 1 + Aq, 2ir + 1} 
output L 

If we wish to output a shortest path then all we do is apply the algorithm DPillarMin but remember 
which shortest path corresponds to the final value of L and output this shortest path (note that there may 
be more than one shortest path; exactly which path one obtains depends upon how one implements checking 
the paths of H). The time complexity of both algorithms is clearly 0{k)', that is, linear in the number of 
columns. Henceforth, we assume that the algorithm DPillarMin outputs an actual shortest path. 

It should be clear (using Lemma [3]) that the different considerations for r and s exhaust all possibilities 
and that consequently the set of paths H considered by DPillarMin is such as to contain a shortest path. 
Hence, DPillarMin clearly outputs a shortest path from some source node to some destination node in 
DPillar„ fc. In summary, we have the following result. 

Theorem 4. Suppose that n,k > 2 so that n is even. The algorithm DPillarMin takes as input any two 
servers of DPillaVn^k, a source and a destination, and outputs a shortest path from the source server to the 
destination server; moreover, it computes this path with time complexity 0{k). 

We can confirm that we have undertaken experiments so as to empirically check, using a breadth-first 
search, the correctness of DPillarMin on DPillar„_fc when n and k are relatively small. We undertook our 
experiments using our in-house simulator INRFlow [5]. 

6.4 The diameter of DPillar 

We also compute the diameter of the DCN DPillar„ fc, i.e., the maximum of the lengths of shortest paths 
joining any two distinct servers. All that was stated in [TB] was that the diameter of the DCN DPillar„_fc is 
a ‘linear function of k\ 

Theorem 5. If k G {2,3} then the DCN DPillarn^k has diameter k; and if k > 4: then the DCN DPillarn^k 
has diameter k + \_^\ — 2. 

Proof. Let src and dst be nodes of the digraph DPillarjj fc. W.l.o.g. we may assume that the column-index 
of src is 0 and that of dst is x. We work in Gn,k(0,x) and in the context of the algorithm DPillarMin. 

We first note that for any x, the worst-case scenario is when all nodes of Gn,k(0,x) are marked as a 
shortest path in this scenario yields a path in any other scenario (though not necessarily a shortest one). 
Hence, in what follows we assume that all nodes are marked. 

Case (a): k > 5. 

We consider first the case when x ^ 0. There are 5 different scenarios for (r, s): (0, > 2); (I, > 2): (> 2, > 2); 
(> 2 , 1 ); and (> 2 , > 2 ). 

Consider first when r > 2 and s > 2. By consideration of the algorithm DPillarMin, where we have 
5 = 1, ii = X + 1, ir = k — 1, e = 1, js = 1, ji = X — 1, Aq = 1 and A^, = I, we immediately see that 
L = min{fc-|-a;, 2k — x,2k — 2 — x,2k — 2{x+l) + x,2{k — 1) — x+1, k — 2 + x,k — 2 + x,k + 2{x — 1) — x + 1} = 
min{fc + X — 2,2k — X — 2}. We are trying to find a value of x that maximizes this minimum value. If 
k + x — 2>2k — x — 2 then x > |; so, in this situation this minimum value is maximized when x = \^~\ and 
this minimum value is then 2k — \^~\ — 2 = k + l^\ —2. lik + x — 2< 2k — x — 2 then a: < I; so, in this 
situation this minimum value is maximized when a; = [|J and this minimum value is then k + — 2 . 

In each of the other 4 cases for (r, s), where x € {0,I,fc — 2,k — I}, we have that the length of the 
path produced by DPillarMin is trivially less than k -I- [|j — 2 (simply look at the initial minimization 
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L = min{fc + x, 2k — x}). Also, when x = 0 the length of the path produced is trivially less than k + — 2. 

Hence, when fc > 5 the dameter is A: + — 2. 

Case (6) 2 < fc < 4. 

It is trivial to see by hand that the diameter in this case is fc. The result follows. □ 

7 Experimental work 

Whilst we have obtained an optimal single-path routing algorithm for DPillar (optimal in that our algorithm 
always outputs a shortest path), as yet we have no idea as to how often the single-path routing algorithm 
DPillarSP is sub-optimal and the savings to be made by employing our optimal algorithm. To undertake 
a precise analytical evaluation of this question would be challenging; consequently, we proceed to evaluate 
empirically the most important performance metrics, namely path length, aggregate bottleneck throughput 
and transmission latency. 

We undertake our evaluation using simulation. We use our own flow-based framework INRFlow [8]. 
The reason we adopt a simulation-based evaluation is as follows. Future DCNs are intended to incorporate 
hundreds of thousands, if not millions, of processors. Consequently, building a test-bed of servers (bearing 
in mind realistic access to resources) would only yield a DCN with a handful of servers and there would 
be no grounds for believing that any such evaluation would scale up. For instance, in order to build even 
the smallest meaningful DPillar„ ^ would require that n should be at least 6 and fc at least 3 which would 
result in a test-bed with 81 servers which is beyond our means. Not surprisingly, simulation is the standard 
evaluation mechanism in the literature. Of the DCNs mentioned in this paper, FiConn, MCube, HCN, 
BCN, SWKautz, SWCube, and SWdBruijn were all evaluated using simulation with only DCell, BCube, 
and CamCube evaluated using test-beds, incorporating 20, 16 and 27 servers, respectively. In addition, 
the aspects of symmetry present in DCNs ameliorates the likelihood of ‘random’ aspects of the network 
topology having an unexpected impact upon performance when compared with more unstructured networks. 
Finally, as regards our evaluation of communication latency in Section 17.31 we have incorporated realistic 
measurements of protocol stack, propagation, data transmission, and routing latencies into our analysis. 

7.1 Path Length 

In order to obtain some idea of the practical significance of our algorithm DPillarMin in terms of path length, 
we undertook the following experiment. For specific values of n and fc, we measured the average length of 
the paths obtained by employing both DPillarMin and DPillarSP for every possible source-destination pair 
(node-symmetry means that we can actually fix a unique source node) as well as the cumulative frequencies 
of the lengths of paths arising. We also measured the number of such occasions when the path derived by 
DPillarSP is longer than the path derived by DPillarMin; that is, the number of times DPillarSP produced 
a non-minimal path. Our results are shown in Table [2] and Table [3] In Table [2 the columns denote (in 
order): the parameters n and fc of the particular DPillar„ ^ that we are working with; the number of servers 
in that DPillar„_fc; the average path lengths obtained from inputting every possible source-destination pair 
to the algorithms DPillarMin and DPillarSP; the improvement in terms of average path length obtained by 
employing DPillarMin as a percentage of the average path length obtained by employing DPillarSP; and 
the percentage of source-destination pairs where the optimal path length is shorter than that obtained by 
employing DPillarSP. In Table 12 for each chosen n and fc we show the cumulative frequencies of the lengths 
of paths obtained by employing the two algorithms DPillarSP and DPillarMin. These cumulative frequencies 
are shown as percentages of the total number of pairs of (not necessarily distinct) servers and are rounded 
to the nearest 0.1% (in order to save space we do not show data relating to all pairs of n and fc; this omitted 
data is as might be expected). 

As can be seen from Table [2 using the algorithm DPillarMin yields a very significant improvement of 
between 25% and 30% in terms of the average path length. It is also worth highlighting that the number 
of non-optimal paths generated by DPillarSP is between 66% and 78% and increases significantly with fc. 
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Table 2: Average path lengths: DPillarMin vs. DPillarSP. 


DPillar„,fc 

#of 

av. pth. len. 

av. pth. len. 

av. length 

non-min. 

n 

k 

servers 

DPillarMin 

DPillarSP 

improve. 

paths 

16 

3 

1,536 

2.72 

3.86 

29% 

66 % 

16 

4 

16,384 

3.74 

5.36 

30% 

73% 

16 

5 

163,840 

4.77 

6.86 

30% 

78% 

32 

3 

12,288 

2.86 

3.93 

27% 

67% 

32 

4 

262,144 

3.87 

5.43 

28% 

74% 

48 

3 

41,472 

2.9 

3.96 

26% 

67% 

64 

3 

98,304 

2.93 

3.97 

26% 

67% 

80 

3 

192,000 

2.94 

3.97 

25% 

67% 

128 

3 

786,432 

2.96 

3.98 

25% 

67% 


Table 3: Cumulative frequencies of path lengths: DPillarMin vs. DPillarSP. 


DPillar„,fc 

rout. 

path lengths 

n 

k 

alg. 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

16 

3 

SP 

0.1 

0.6 

4.8 

38.0 

70.8 

100 

— 

— 

— 

— 

16 

3 

Min 

0.1 

2.0 

26.2 

100 

- 

- 

- 

- 

- 

- 

16 

5 

SP 

0.0 

0.0 

0.0 

0.4 

2.9 

22.9 

42.9 

62.8 

82.5 

100 

16 

5 

Min 

0.0 

0.0 

0.3 

2.5 

20.3 

100 

- 

- 

- 

- 

32 

4 

SP 

0.0 

0.0 

0.1 

1.7 

26.7 

51.7 

76.6 

100 

— 

— 

32 

4 

Min 

0.0 

0.0 

0.7 

12.0 

100 

- 

- 

- 

- 

- 

80 

3 

SP 

0.0 

0.0 

0.0 

33.3 

67.4 

100 

— 

— 

— 

— 

80 

3 

Min 

0.0 

0.1 

5.7 

100 

- 

- 

- 

- 

- 

- 

128 

3 

SP 

0.0 

0.0 

0.5 

33.9 

67.2 

100 

— 

— 

— 

— 

128 

3 

Min 

0.0 

0.0 

3.6 

100 

- 

- 

- 

- 

- 

- 


Note that a reduction in path length does not only mean that the latency experienced by network traffic 
should be reduced (more on this later) but also that each flow will require less aggregate bandwidth to be 
transmitted and so the overall throughput of the network should also increase. As can be seen from Table O 
in each of the chosen scenarios DPillarMin yields significant cumulative improvements in path length. For 
example, when n = 16 and fc = 5, with DPillarSP only 22.9% of all paths have length at most 5 whereas with 
DPillarMin all paths do. We measure next the aggregate bottleneck throughput obtained through using the 
two different routing algorithms. 

7.2 Aggregate Bottleneck Throughput 

The aggregate bottleneck throughput^ or simply ABT, is a metric introduced in |13j in order to estimate 
the throughput performance of a DCN. The reasoning behind ABT is that the performance of an all-to-all 
operation is limited by its slowest flow, i.e., the flow with the lowest throughput. The ABT is defined as the 
total number of flows times the throughput of the bottleneck flow, i.e., the link sustaining the most flows. 
In our experiments the bottleneck flow is determined experimentally using actual routing functions within 
our framework INRFlow. We assume an all-to-all communication pattern, so that there are N{N — 1) flows, 
and a bandwidth of 1 nnit per directional link, where N is the total number of servers. Since datacenters 
are most commonly nsed as stream processing platforms and are therefore bandwidth limited, this is an 
extremely relevant performance metric. 

Table m shows that DPillarMin is capable of offering much higher ABT than DPillarSP in all cases, with 
improvements of between 74% and 122% (the right-most is the improvement in ABT by using DPillarMin 
rather than DPillarSP divided by the ABT of DPillarSP). Informally, this means that bandwidth-limited 
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Table 4: Aggregate bottleneck throughput; DPillarMin vs. DPillarSP. 


DPillar„,fc 

#of 

ABT 

ABT 

ABT 

n 

k 

servers 

DPillarMin 

DPillarSP 

improve. 

16 

3 

1,536 

757.16 

397.93 

90% 

16 

4 

16,384 

6077.88 

3056.72 

99% 

16 

5 

163,840 

52953.26 

23883.38 

122 % 

32 

3 

12,288 

5651.85 

3126.72 

81% 

32 

4 

262,144 

92102.69 

48276.98 

91% 

48 

3 

41,472 

18634.09 

10472.73 

78% 

64 

3 

98,304 

43653.56 

24761.71 

76% 

80 

3 

192,000 

84659.97 

48362.72 

75% 

128 

3 

786,432 

343097.99 

197595.98 

74% 


applications such as, for example, Big Data analytics, running over a DCN using DPillarMin might be able to 
achieve nearly twice as much computational throughput as the same application running over a DCN using 
DPillarSP. This can provide significant savings in terms of running and maintenance costs associated with 
each application and thus will result in more competitive pricing for tenants. Furthermore, as applications 
run faster it will be possible to run more applications in a given time frame and so there is a huge potential 
for increasing the overall profit of the datacenter. 

7.3 Communication Latency 

Not all datacenter applications are bandwidth sensitive; indeed, many of them are more sensitive to latency, 
such as real-time operations or, more generally, any application interfacing with users. For this reason, it 
is important that we look at the transmission latency that we can expect from DPillarSP and DPillarMin. 
As there is no server-centric DCN framework available that will enable us to perform testbench experiments 
(building one ourselves is not possible), we measure the latencies imposed by the different steps of the 
transmission, namely within the protocol stack, propagation latency, data transmission latency and routing 
at the servers, so as to obtain an estimate of how changing the routing algorithm would affect the overall 
performance. Our experiments were as follows. 

• We measured the round trip time of both an empty frame (28 bytes for the headers) and a full frame 
(1,500 bytes, including the headers) sent to localhost so as to measure the latency imposed by going 
up and down the protocol stack. In both cases, the stack latency, Lg, was found to be 10 ps. 

• We measured the round trip of an empty frame sent to a neighbouring server connected to the same 
Gigabit Ethernet switch. This was found to be 64 ps; thus we can compute the one-way propagation 
latency, Lp, i.e., the time to go through the links and the switch, by dividing by two and removing the 
stack latency. This yields a propagation latency of 22 ps. 

• We measured the round trip time of a full-frame sent to the same neighbouring server. This was found 
to be 140 ps] thus the one-way data transfer latency, Ld, can be calculated similarly by dividing by two 
and subtracting the stack latency as well as the propagation latency. This results in a data transfer 
latency of 38 ps (roughly 26 ns per byte). 

• We measured the average routing latency, , for both algorithms for a selection of the configurations 
above (those with between 8 and 40 thousand servers). Note that the code for the two algorithms is 
not optimised and that it includes some overheads imposed by our framework; so the running times 
for the algorithms can be considered as a worst case. 

Consequently, for both DPillarSP and DPillarMin we obtain the per-hop latency L^op = Lg + Lp + Ld + L^ 
along with the server-to-server latency Ltotai = Lhop x d, where d is the average path length. 
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All the measurements were carried out under low load conditions in the same server: a 32-core AMD 
Opteron 6220 with 256 Gbytes of RAM and running an Ubuntu 14.04.1 SMP OS. Round-trip time mea¬ 
surements were carried out with the ping utility. The server and its neighbour are located within the same 
rack and are connected with short (at most 1 metre) electrical wires to a 24-port 1-Gbit Ethernet switch 
which does not support jumbo frames (we do not have 10-Gbit Ethernet hardware available). Note that the 
use of short wires is the best case for the propagation delay, as in a real scale-out datacenter wires will be 
much longer and so propagation delays will be larger (even if fibre connections are used |21)'). Similarly, the 
measured latency of the protocol stack does not take into account any extra management/control inherent 
to the server-centric nature of the system; so, again, it can be considered a best case scenario. Increasing 
these delays will dilute the effects of the average routing latency in the total latency even more than in our 
preliminary estimate. (We remind the reader that a DPillar datacenter would be constructed out of GOTS 
hardware and so our experimental set-up is reasonable). 


Table 5: Average routing latencies: DPillarMin vs. DPillarSP. 


DPillar„,fc 

#of 

servers 

Ijrp 

DPillarMin 

Ij Y" 

DPillarSP 

Y 

increase 

n 

k 

16 

4 

16,384 

5.964 ^is 

1.349 /rs 

442% 

32 

3 

12,288 

3.325 /rs 

0.960 /rs 

346% 

48 

3 

41,472 

3.328 

0.859 

387% 


Table 6: Per-hop and overall latencies: DPillarMin vs. DPillarSP. 


DPillar„,fc 

Lhop 

DPillarMin 

Lhop 

DPillarSP 

Lhop 

decl. 

Ltotal 

DPillarMin 

Ltotal 

DPillarSP 

Ltotal 

improve. 

n 

k 

16 

4 

76.0 

71.3 

6 % 

284.1 

382.2 

26% 

32 

3 

73.3 

71.0 

3% 

209.5 

279.1 

25% 

48 

3 

73.3 

70.9 

3% 

212.9 

280.4 

24% 


Table [S] shows the average routing latency Lr for DPillarSP and DPillarMin, along with the increase in 
the average routing latency when using DPillarMin as opposed to DPillarSP (shown as a percentage of the 
average routing latency when using DPillarSP). Table |6] details the per-hop and server-to-server latencies for 
both DPillarSP and DPillarMin. The very slight increase in the per-hop latency when using DPillarMin as 
opposed to DPillarSP is shown, as is the improvement in the server-to-server latency when using DPillarMin 
as opposed to DPillarSP (both are shown as a percentage of the corresponding value for DPillarSP). 

It can be seen that the average routing latency for DPillarMin is between 3.4 and 4.5 times slower than 
that for DPillarSP, but of the order of only a few microseconds which is well below the other latencies 
measured in our experimental set-up. In consequence, the per-hop latencies of DPillarSP and DPillarMin 
are very similar; however, there is a significant reduction in server-to-server latency for DPillarMin over 
DPillarSP (between 24% and 26%) when the reductions in average path-length are factored in. 

Informal analytical modelling using the values above as a reference suggests that if jumbo frames were used 
then there would be negligible increase in per-hop latency so as to yield a significant overall improvement in 
server-to-server latency of up to 30%. A similar analytic analysis using lO-Gbit Ethernet hardware suggests 
that while per-hop latency can increase by up to 12.5% when using DPillarMin as opposed to DPillarSP, the 
overall server-to-server latency improvement will still be in the range 19-23%. Finally, the estimates with 
jumbo frame-enabled 10-Gbit Ethernet yield very similar results as the ones presented here. Full details are 
available in the supplemental material. 

While the latency analysis performed here is rather simplistic and only covers zero-load latencies, our 
objective is not to provide highly accurate latency figures but to show that the impact of the routing 
algorithm DPillarMin on latency is insignificant, particularly when compared with the huge gains in terms 
of path length and throughput. Note that due to its less favourable throughput, the use of DPillarSP would 
lead to additional queuing in the servers which would in have a detrimental impact upon performance. 
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7.4 FiConn and DCell 


There does not exist a proper comparative experimental evaluation of the numerous (dual-port) server¬ 
centric DCNs in the literature; comparative evaluations that have been undertaken so far are somewhat ad 
hoc, both in terms of the DCNs compared and the performance metrics evaluated. Of course, an extensive 
comparative evaluation will be a significant body of work and is well beyond the scope of this paper (where 
our focus has been on improving routing in DPillar). Moreover, we are fully aware that there are many 
different metrics for DCN evaluation, such as those relating to fault-tolerance, bisection bandwidth, load 
balancing, latency, throughput, scalability, and so on, and that the eventual success of a DCN will usually 
depend on its capacity to cope well across a range of such metrics. Nevertheless, we end our experimentation 
by including an interesting prelude to a fuller analysis of routing within server-centric DCNs: we briefly 
compare routing in DPillar with routing in the two DCNs DCell and FiConn. 

We have chosen FiConn and DCell as they are widely regarded as benchmark server-centric DCNs. Like 
DPillar, FiConn is dual-port, whereas DCell is such that the number of server NIC ports is variable. The 
reader is referred to [16] and [12] for definitions of FiConn and DCell, respectively, but just as with DPillar, 
FiConn and DCell are families of DCNs parameterized by n, the number of switch-ports in a switch, and fc, 
the depth of the recursive construction (actually, a server in DCell„^fe has fe -I- 1 NIC ports). 

In Table jT] we have displayed the average path length and the ABT of (various instantiations of) DPillar 
with the routing algorithm DPillarMin, FiConn with the routing algorithm TOR (from [16]), and DCell with 
the routing algorithm DCellRouting (from [E]); we have chosen these instantiations so that the different 
DCNs can be compared on three different bases, namely them all having roughly 24K, 117K and 170K 
servers, respectively (so, we have normalized against the number of servers). As usual, the data in Table [7] 
has been derived using our tool INRFlow. 


Table 7: Average path lengths and ABT: DPillarMin vs. FiConn vs. DCell 


DPillar„,fe 

#of 

av. pth. len. 

ABT 

n 

k 

servers 

DPillarMin 

DPillarMin 

12 

5 

38,880 

4.68 

12805.63 

16 

5 

163,840 

4.77 

52952.94 

18 

4 

26,244 

3.77 

9616.46 

26 

4 

114,244 

3.84 

40637.47 

48 

3 

41,472 

2.90 

18633.64 

64 

3 

98,304 

2.93 

43653.12 

FiConn„fe 

#of 

av. pth. len. 

ABT 

n 

k 

servers 

TOR 

TOR 

10 

3 

116,160 

12.97 

13026.18 

24 

2 

24,648 

6.56 

5005.47 

36 

2 

117,648 

6.71 

23694.75 

40 

2 

177,240 

6.74 

35650.59 

DCell„,fc 

#of 

av. pth. len. 

ABT 

n 

k 

servers 

DCellRouting 

DCellRouting 

3 

3 

24,492 

10.18 

5475.43 

4 

3 

176,820 

11.29 

33582.97 

12 

2 

24,492 

6.34 

6968.73 

18 

2 

117,306 

6.56 

31937.10 


As can readily be seen, DPillar compares extremely well with FiConn and DCell in terms of both average 
path length and ABT (even though DCell would appear to have a natural advantage over the other two 
DCNs as it involves servers with more than two NIC ports). 

We end with some comments on our very brief comparative evaluation. First, we reiterate that what 
is really required is an extensive evaluation involving a range of server-centric DCNs across a range of 
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performance metrics. Second, we observe (in comparison with data in Tables [5] and 0]) that the improvements 
made in using DPillarMin in the DCN DPillar, rather than DPillarSP, have resulted in moving DPillar from 
only comparable with DCell and FiConn to better than DCell and FiConn (at least in terms of average paths 
length and ABT). Third, there is no reason why a closer combinatorial scrutiny of both DCell and FiConn 
might not result in new and better routing algorithms than DCellRouting and TOR, respectively (just as 
we have improved routing within DPillar within this paper). 


8 Conclusions 

In this paper we have: developed an optimal and practical single-path routing algorithm DPillarMin for 
the DCN DPillar; shown that DPillar is a Cayley graph, and so node-symmetric; and provided an exact 
formulation of the diameter of DPillar. Our experimental results show not only that DPillarMin can signif¬ 
icantly reduce the average path length of network traffic (up to 30%), but also that this reduction results in 
a significant increase (more than 2x) in terms of overall network throughput. Finally we showed that the 
computational overhead of DPillarMin is negligible and will barely affect the processing of network traffic: 
less than a 6% increase in per-hop latency, which is more than compensated by the reductions in path length. 

In summary, we can claim that our proposed routing algorithm can unleash a massively improved perfor¬ 
mance to the DPillar DCN. Furthermore, we feel that there are other areas where efficiency gains might be 
made; in particular, with regard to multi-path routing. Of course, we reaffirm our statement above that what 
also needs to be undertaken is an holistic comparison of different (dual-port) server-centric DCNs, with their 
different routing algorithms and across a wide range of performance metrics, along with the combinatorial 
study of DCNs different to DPillar with a view to improving their routing algorithms. 
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Appendix 

A.l Proof of Lemma 1 

Lemma 6. The digraph DPillarn^k is a Cayley graph. 

Proof. We first define a set of elements and then a notion of multiplication on this set. Let to,ti,... ,tk-i 
be distinct symbols and for each z € {0,1,..., fc — 1}, define 

G. = ■ • ■ tT--i : 0 < P, < I - 1, for all j e {0,1 ,... ,k - 1}}. 


Define G^ = so, note that |G^| = . Define 


2 ' 
0+0 


~ {^fe-1^0^1 • ■ • ik-2 ■ 0 < 9 < 2 
5, = {Yt?...ft,:0<g<^-l}; 


Sc = {t'it\...ti_^ti-.o<q<'^-\y, 


Sd = : 0 < g < - -1}. 


Define a right-multiplication o of the elements of G^ by the elements of S' = S'a U 5b U 5c U S'd as follows: 


pPixPi+l +Pfc-1+P0+Pl +Pi-l_+9 +0+0 +0 _ +Pi-l+9+Pi+Pi + l +Pfc-1+P0+Pl +. 

bi t,_Li . . . '-0 '-1 • ■ • ''fc-l'-O^'l ■ • ■ ''fe-2 — ''i '•i-l-1 • ■ • ''fc-1 '-o''! • ■ • 


i +1 ' ■ * k —1 
+Pi+Pi+1 +Pfc-1+P0+Pl 


Pi-2 
i-2 


+F.+Fi+1 +FK-l+F0+i^l +Pi-l_+9+0 +0 _ +Pi-|-I?+Pi + 1 +Pfc-1+P0+Pl +Pi-1 
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+Pi+Pi + 1 +PIC-1+P0+P1 +Pi-l^+0+0 +0 +<? _ +Pi + 1 +Pfc-1+P0+Pl +Pi-l+Pi-|-g 

'•i 'i+l ■ • ■ 'fe-l 'o '•1 • ■ • 'i-l “ ''l''2 • ■ • 'fc-l'o “ '•i-l-l • ' ' 'fc-1 'o 'l • ' ' 'i-1 'i 


Pi+Pi+l 
i-l-1 


P't 


+/^fc-i++'o+fci +/^i-i ^ +y _bP^e 

• ■ • 'fc-1 'O 'l • ■ • 'i-1 ^ 'O'l ■ • ■ 'fc-1 ~ 'i 


Pi+Pi+l +Pfc-1+P0+Pl +. 

• ■ • 'fc -1 'o 'i • ■ •' 


i+1 


Pi-1+9 
i-1 1 


with i, po, pi, ..., pk-i and q as appropriate and with addition on superscripts modulo 

We now extend the multiplication we have just defined so that we make G^ into a group with a generating 
set So that is a subset of 5. Let (5) be the set of all elements generated by right-multiplication by elements 
of 5. It is trivial to show that this set is G^; that is, 


Gfc = {((■ • ■ ((si o S 2 ) o S 3 )...) o Si) -. i> 1, Sj G 5 for g = I, 2,..., •*}. 


Extend the multiplication o to G^ by defining that no matter how a multiplication of elements of 5 is 
bracketed, e.g., as (si o (s 2 o S 3 )) o (s 4 o S 5 ), the product is defined as that obtained by multiplying on the 
right, e.g., as ((((si o 52 ) o S 3 ) o S 4 ) o ss). Consequently, we have now equipped G'f with an associative 
multiplication o. It is trivial to check that there is an identity in G]f (w.r.t. o; it is .. •t°_i) and also 
that every element of G^ has an inverse; furthermore, every element of 5 has an inverse in 5. Hence, G^ 
is a group generated by the 2n — 2 elements of 5o = 5 \ {tot? ... tfe_i} and Sq is closed under inverses. Let 
Gfc (5) be the Cayley graph of G^ w.r.t. the generating set 5o. 

Finally, we prove that the Cayley graph Gfc(5o) is exactly the same as DPillar„ fc. In what follows, by 
DPillar„ fc we mean the digraph DPillar„ fc. Define the mapping (p from the nodes of Gfc(5o) to the nodes of 
DPillar„,fc by 






{i,Pk-lPk-2 ■■■Po) 
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where i, pq, pi, .. Pk-i are as appropriate. As 


fPif. 

''i ''z+l 


pPk-l.pO-LPl PPi-l MM M _ pPi-l^qMiPPi+l -LPk-l,po,pi pPi-2 

■ ■ ■ ''fc-1 ''0 ''1 ■ ■ • ‘'i-1 ■ • ■ ''fc-2 — '-i-l ''i ''i+l • ■ • ''fc-1 ''0 ''1 • ■ • ''i-2 I 


this describes the a-edge of DPillar„_fc from {i,pk-iPk-2 ■ --Po) to {i - l,pk-iPk-2 ■ ■ -PiiPi-i + q)Pi-2 ■ ■ -Po)- 
As 


J.PiMi + 1 
'■i ''i+1 


Mk-l.po.pi i 

• ■ • ''fe-1 ''0 '■1 • ■ • 


.Pi-1 ^mM 


o tgti... = t 


0 _ j.Pi+qj.Pi +1 


fc -1 


i+l 


p>k-i,po,pi 

• ■ • ''fe-1 ''0 '■1 • ■ • 


.Pi-1 


this describes the b-edge of DPillar„^fe from {i,pk-iPk-2 ■ ■ -Po) to {i,pk-iPk-2 ■ ■ •Pi+i(Pi + g)Pi-i • ■ -Po) when 
g > 0 . As 

j.Pij.Pi +1 J-Pk-IMOMI j.Pi-1 ^MM M M _ j.Pi + 1 Mk-l .po Ml Mi-l Mi+q 

ki ti+i ■ • ■ Iq ''1 ■ • ■ ''i-l “ '-I'-l ■ ■ • ''fc-l'-o “ '-i+l • ■ • ''fc-1 'o'! • ■ • 'i-l 'i ’ 

this describes the c-edge of DPillar„^fe from {i,pk-iPk-2 ■■■Po) to (i-b l,Pfc-iPfc-2 ■ ■ .pi+i(pi +g)pi-i ■ • - Po)- 
As 


,Pi-l+q 


MiJ-Pi + l J-Pk-IMOJ-Pl J.Pi- 1 -MM M _Mi 4 .Pi + l J.Pk -1 Mo J-Pl Mi-- 

'i 'i+l • ■ • 'fc-1 'o 'l • ■ • 'i-l 'o'l ■ • ■ 'fc -1 “ 'i 'i+l ' ' ' 'fc-1 'o 'l ' ' ' 'i-l 

this describes the d-edge of DPillar„_fc from {i,pk-iPk-2 ■ ■■Po) to {i,pk-iPk-2 ■ ■ ■Pi(Pi-i+9)Pi-2 ■ ■ ■Po) when 
g > 0. Consequently, p is an isomorphism of G^(5'o) to DPillar„_fc and the result follows. □ 


A.2 Modelling other network configurations: jumbo frames and 10-Gbit Eth¬ 
ernet 

While we were unable to carry out experiments with more advanced datacenter networking equipment, such 
as switches capable of dealing with jumbo frames or 10-Gbit Ethernet NICs or switches, it should be possible 
to extrapolate their performance from the statistics we captured from our experimental set-up. Having 
estimates for the latency expected from these configurations is useful as they can be seen as pathological 
cases in relation to the performance gains inherent to DPillarMin. Using jumbo frames (that is, frames with a 
payload of 9,000 byte^, rather than the standard 1,472 bytes) means that any routing algorithm is executed 
less often and that the protocol- and propagation-induced delays become less substantial when compared 
with the data transmission delay. In our case, this means that the overhead due to using DPillarMin becomes 
less signihcant; consequently, the overall delay will be even better than with standard frames. On the other 
hand, the higher bandwidth of 10-Gbit (10 x) equipment means that the per-hop delay will be reduced which, 
in turn, means that the time taken to undertake routing computations may become dominant. However, 
according to our assessment this will not be the case. 

We now explain how we extrapolate the per-hop latency and server-to-server latency for these technologies 
from the latencies we measured empirically in Section 7.3 (that is, Lg, Lp, Ld and Lr). 

• The stack latency, Lg, should not change as it is due to software executions at the server-side. 

• The propagation latency, Lp, would barely be affected by the bandwidth of the links, or the size of the 
frames, but would be affected by the length of the links or the transmission media used (copper/fibre). 
For simplicity, we assume the propagation latency does not vary. 

• The data transfer latency depends on the transmission bandwidth and the size of the frame. For 
simplicity, we assume perfect linear scaling of the per-byte delay: 26 ns per byte for 1-Gbit Ethernet 
and 2.6 ns per byte for 10-Gbit Ethernet, multiplied by either 1 standard frame (1,472 bytes) or 1 
jumbo frame (9,000 bytes). 

• There is no change to the average routing latency, L^. 

^Different networking equipment may have different frame length limits. For simplicity, we stick to a payload of 9,000 bytes, 
even though some devices can handle even larger jumbo frames, e.g., Cisco devices can typically handle up to 9,216 bytes. 
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The extrapolation for 10-Gbit Ethernet (see Table EU suggests that, even though the per-hop delay 
might go up significantly with faster networks, the great improvement in path length achieved by DPillarMin 
still compensates for this and provides an improvement in terms of overall latency of between 20% and 23%. 
The use of jumbo frames alleviates the overhead incurred by using DPillarMin (see Tables and ETSl) and 
raises the improvement in terms of overall latency up to 29% (1-Gbit Ethernet) and 24% (10-Gbit Ethernet). 

Further informal analysis using stack and propagation delays that were one order of magnitude smaller 
than the ones obtained in our empirical testing, suggested that with standard frames the overall latency will 
still be reduced by around 20 to 23% with 1-Gbit Ethernet and between 4% and 16% with 10-Gbit Ethernet 
in most of the cases. If jumbo frames were considered then the stack and propagation delays barely affect 
the overall latency so the figures remain similar to those discussed above. 


Table A.l: Per-hop and overall latencies with 10-Gbit Ethernet and standard frames. 


DPillar„,fc 

Lfiop 

DPillarMin 

Lhop 

DPillarSP 

Lhop 

decl. 

Ltotal 

DPillarMin 

Ltotal 

DPillarSP 

Ltotal 

improve. 

n 

k 

16 

4 

41.76 

37.15 

12% 

156.2 

199.0 

22% 

32 

3 

39.12 

36.76 

6% 

111.8 

144.6 

23% 

48 

3 

39.13 

36.66 

7% 

113.6 

145.0 

22% 


Table A.2: Per-hop and overall latencies with 1-Gbit Ethernet and jumbo frames. 


DPillar„,fc 

Lhop 

DPillarMin 

Lhop 

DPillarSP 

Lhop 

decl. 

Ltotal 

DPillarMin 

Ltotal 

DPillarSP 

Ltotal 

improve. 

n 

k 

16 

4 

270.30 

265.69 

2% 

1011.0 

1423.3 

29% 

32 

3 

267.66 

265.30 

1% 

764.6 

1043.5 

27% 

48 

3 

267.67 

265.20 

1% 

777.3 

1049.3 

26% 


Table A.3: Per-hop and overall latencies with 10-Gbit Ethernet and jumbo frames. 


DPillar„,fc 

Lhop 

DPillarMin 

Lhop 

DPillarSP 

Lhop 

decl. 

Ltotal 

DPillarMin 

Ltotal 

DPillarSP 

Ltotal 

improve. 

n 

k 

16 

4 

61.20 

56.58 

8% 

228.9 

303.1 

24% 

32 

3 

58.56 

56.19 

4% 

167.3 

221.0 

24% 

48 

3 

58.56 

56.09 

4% 

170.1 

221.9 

23% 
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